Metric Suffix Array For Large-Scale Similarity Search
نویسندگان
چکیده
We propose the Metric Suffix Array (MSA), as a novel and efficient data structure for permutation-based indexing. The Metric Suffix Array follows the same principles as the suffix array. The suffix array is mainly used for text indexing. Here, we build the MSA as an alternative for large-scale content based information retrieval. We also show how the MSA is scalable for parallel and distributed architectures. We study the performance and efficiency of our algorithms in a large-scale context. Experimental results show fast response time with high efficiency and effectiveness.
منابع مشابه
RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data
SUMMARY With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20-90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments d...
متن کاملCompressed Spaced Suffix Arrays
Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still su...
متن کاملAcceleration of spoken term detection using a suffix array by assigning optimal threshold values to sub-keywords
We previously proposed a fast spoken term detection method that uses a suffix array data structure for searching large-scale speech documents. The method reduces search time via techniques such as keyword division and iterative lengthening search. In this paper, we propose a statistical method of assigning different threshold values to sub-keywords to further accelerate search. Specifically, th...
متن کاملSequence Covering for Efficient Host-Based Intrusion Detection
This paper introduces a new similarity measure, the covering similarity, that we formally define for evaluating the similarity between a symbolic sequence and a set of symbolic sequences. A pair-wise similarity can also be directly derived from the covering similarity to compare two symbolic sequences. An efficient implementation to compute the covering similarity is proposed that uses a suffix...
متن کاملCSA++: Fast Pattern Search for Large Alphabets
Indexed pattern search in text has been studied for many decades. For small alphabets, the FM-Index provides unmatched performance, in terms of both space required and search speed. For large alphabets – for example, when the tokens are words – the situation is more complex, and FM-Index representations are compact, but potentially slow. In this paper we apply recent innovations from the field ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013